Visual Intelligence in Precision Agriculture: Exploring Plant Disease Detection via Efficient Vision Transformers

In order for a country’s economy to grow, agricultural development is essential. Plant diseases, however, severely hamper crop growth rate and quality. In the absence of domain experts and with low contrast information, accurate identification of these diseases is very challenging and time-consuming. This leads to an agricultural management system in need of a method for automatically detecting disease at an early stage. As a consequence of dimensionality reduction, CNN-based models use pooling layers, which results in the loss of vital information, including the precise location of the most prominent features. In response to these challenges, we propose a fine-tuned technique, GreenViT, for detecting plant infections and diseases based on Vision Transformers (ViTs). Similar to word embedding, we divide the input image into smaller blocks or patches and feed these to the ViT sequentially. Our approach leverages the strengths of ViTs in order to overcome the problems associated with CNN-based models. Experiments on widely used benchmark datasets were conducted to evaluate the proposed GreenViT performance. Based on the obtained experimental outcomes, the proposed technique outperforms state-of-the-art (SOTA) CNN models for detecting plant diseases.


Introduction
Over the past few decades, agriculture has emerged as the primary source of income for several countries, significantly contributing to the global economy. As per the World Bank report of 2018, agriculture engaged over a billion population, representing 28.5% of the total labor force, and amounted to about 10 million tons of food a day [1]. Although, the total potential of agriculture is prone to plant infections and diseases, food security is compromised by such infections. Major food crops, such as rice, wheat, potatoes, soybeans and maize can suffer losses of 10% to 40% due to plant viruses [2]. Addressing these challenges necessitates frequent inspection of disease symptoms, often inefficient and time consuming, particularly for huge crop fields. In order to achieve precision agriculture, plant infections must be detected effectively. Research groups have been motivated to work on Machine Learning (ML) as a result of the proliferation of this field to explore its potential in automating the detection of plant diseases by analyzing images obtained from fields. To identify diseases, these groups analyze the images and extract significant features. For example, in Ref. [3], Support Vector Machine (SVM) was applied after extracting image features using Scale Invariant Feature Transform (SIFT) to classify guava leaf diseases. last layers of the ViT model, the cosine similarity between patch representations increases significantly, suggesting that increasing the layers does not enhance model performance [27]. The memory requirements of ViT make handling high-resolution images difficult due to their four times length requirement.
Several studies have aimed to overcome the limitations of Transformer-based models, and in general, it falls into two categories: hybrid models and pure-Transformer enhancement. Hybrid models combine the strengths of CNN and Transformer to improve performance. A model based on CNN called Ghost-Enlightened Transformer, for example, was proposed by [28] to construct intermediate feature maps. In the next step, the selfattention mechanism is used to convert those maps into deep semantic features. Based on 12,615 images collected by the author, this model achieved 98.14% accuracy. A similar system is outlined in [29]. As is PlantXViT, it incorporates a VGG16 network, a transmission block and an encoder layer called Transformer. The VGG16 and inception block provide better capture of local image features than SOTA CNN models currently available. Furthermore, multiple studies have incorporated CNN layers into Transformer architectures to amplify the capability of extraction of most prominent features [30][31][32]. As a result of this approach, the model becomes more accurate because it is able to learn local features through the CNN architecture, but the training and inference times are significantly extended and the memory used is huge.
In contrast, pure-Transformer enhancement variants operate primarily based on optimizing the self-attention mechanism to improve performance. Based on shifted windows, the Swin Transformer, for example, calculates local attention efficiently while maintaining connections across windows [33]. Additionally, Ref. [34] developed k-NN attention, which determines the attention matrix based on the top-k related tokens found in the keys, thereby reducing training time. In RegionViT, local self-attention is employed to retain global information through a regional-to-local concept [35]. Several studies have also proposed modifying the self-attention mechanism by using feature channels instead of tokens in the calculation of the self-attention matrix [36] and revamping the spatial attention mechanism to include small-distance, large-distance and all-inclusive information [37]. It involves optimizing the attention matrix calculation process in order to decrease the model's complexity while maintaining the global connection. There are, however, some studies that maintain the original architecture of the self-attention mechanism, leading to a huge number of trainable parameters in each self-attention head in comparison with previous studies. Thus, existing Transformer-based models retain their complexity while being larger. Transformerbased models have these limitations, which hinder their application to intelligent edge applications, such as drones and single-board computers, where resources are limited. We designed the models so that they could be deployed and operated on products that have limited resources, with the aim of minimizing transmission latency and network bandwidth consumption [38]. In summary, this study made the following contributions:

1.
Plant disease detection is now significantly improved using CNN-based models, based on the latest research findings. However, the particular models exhibit limitations such as translation invariance, locality sensitivity and a lack of global image comprehension. To address these shortcomings inherent in CNN-based approaches, this study introduces a new approach utilizing a Vision Transformer-based model for improved and effective plant disease classification.

2.
Drawing inspiration from the Vision Transformer (ViT) proposed by Alexey Dosovitskiy et al. [26], we conducted training and fine-tuning of the ViT model, specifically for fire detection, resulting in notable advancements surpassing the SOTA CNN models. By improving the architecture of the ViT model, it has been possible to reduce the number of learning parameters from 85 million to 21.65 million as a result of the fine-tuning process, which has resulted in an increase in the accuracy of the model at the same time.

The proposed
GreenViT model exhibits exceptional accuracy and effectively reduced the occurrence of false alarms. Consequently, the developed system proves to be highly suitable for accurate plant disease detection, ultimately mitigating the risks associated with food scarcity and security.
This paper is further divided into the following sections: In Section 2, the proposed methodology is presented, outlining the key steps and techniques employed in the study. Section 3 provides a brief description of the experimental results obtained from the conducted experiments. Finally, in Section 4, the paper is concluded, summarizing the main findings, contributions and prospective approach for future work.

Material and Methods
This section begins by introducing the experimental dataset used in the study. Subsequently, the plant disease detection model, named GreenViT, is presented. It is necessary to review the experimental environment, as well as evaluation metrics in order to evaluate the performance of the model.

Datasets
To gauge the effectiveness of the proposed model, the study utilized two popular standard datasets, namely PV and DRLI. Furthermore, to test the model's resilience, a new dataset called PC dataset was utilized, which was created by integrating both datasets. The combined datasets' statistics are listed in Table 1 while the comprehensive details are provided below. A feasibility study was conducted to assess the effectiveness of the newly presented GreenViT method; the authors conducted experiments on two well-known benchmark datasets PV and DRLI. The PV dataset has been widely utilized in previous studies due to its large size, public availability and free access to data on crop leaf disease classification. To validate the classification accuracy of the employed approach, the authors carried out several experiments on this dataset, which comprises images of plants with various types of diseases. The dataset contains a total of 54,303 images from 14 plant species, which are categorized into 38 classes. Of these, 26 classes correspond to infected plants, while 12 belong to healthy plants. The dataset includes images of plants such as tomatoes, strawberries, grapes and oranges. In addition to variations in color, size and lighting, the dataset features image distortions such as noise, blurring and color variations, making it a challenging dataset for detecting and categorizing affected plant leaf regions.

Data Repository of Leaf Images
The intricate interaction between plants and their surroundings leads to the production of various substances that enhance the environment and help in controlling greenhouse gases and climate change. However, in the past, humans have ruthlessly exterminated many plant species, resulting in the loss of biodiversity and further exacerbating climate change. To address this, the identification, detection and diagnosis of plant diseases have become crucial. In this dataset, the authors have chosen twelve plant species, including guava, arjun, mango, alstonia, bael, scholaris, jatropha, jamun, pomegranate, basil and lemon. The leaves of these plants were photographed in both healthy and infected states and were divided into two categories: healthy and infected. The entire dataset contains approximately 4503 photos, with 2278 healthy leaves and 2225 diseased leaves, taken from March to May 2019 at the University of Shri Mata Vaishno Devi in Katra. The dataset was divided into 22 subject groups based on plant species, and the photographs were captured in an enclosed space using a Nikon D5300 camera (Nikon, Tokyo, Japan) with an 18-55 mm lens and sRGB color representation. The photos were taken with 1000 ISO and without flash, resulting in a single JPEG photo in 0.58 s per frame and a RAW + JPEG photo in 0.63 s per frame.

Plant Composite
In order to evaluate the robustness of the proposed GreenViT model, the authors conducted an experiment using a combination of publicly available datasets: PV and the DRLI. By merging these datasets, a new and more diverse dataset was created, which posed greater challenges for the model. The composite dataset consists of a total of 58,807 images, making it 7.6% larger than PV and 92.3% larger than the data repository of leaf images. This increased size and diversity of plant species within the dataset necessitated a meticulous training process for the model. As a result, the model demonstrated improved the generalization ability and enhanced the reliability for real-time plant disease detection scenarios by providing a visual representation.

The Proposed GreenViT Plant Disease Detection Method
The proposed framework has been thoroughly outlined in this section. A Transformer model forms the foundation for our framework. Currently, the Transformer model is widely regarded as the SOTA in handling sequential data processing, particularly in Natural Language Processing (NLP) tasks for instance speech recognition, language modeling and machine translation. The Transformer architecture, introduced by [40], revolves around an encoder-decoder module that facilitates the rearranging and incorporating of a given sequence of elements into a new sequence. The primary objective behind the development of Transformers was to enable parallel processing of data. The purpose of this study is to evaluate the performance of the ViT model in predicting plant diseases. As depicted in Figure 1, the ViT architecture is employed, which takes an input image with dimensions of 72 × 72 pixels. Initially, the input image is divided into patches, and the number of patches utilized depends on the specific scenario being addressed. In this study, the input image is converted into six image patches. To accommodate 2D images with height (H), width (W) and (C) channels, the image, denoted as X ∈ (H×W×C) is reshaped into a sequence structure resembling word embedding. This transformed representation is then used as input to the transformer network, which processes the 2D patches (P) X P ∈ N(P 2 ,C) . This is a representation of the actual image X P , and the resolution of the patches is characterized by (P, P). The most functional length of the sequence for the transformer is determined via N = HW/P 2 . In the transformer network, these patches are treated in a similar manner as tokens in NLP. In each layer of the transformer, a fixed width is maintained, and a trainable linear projection is applied to map each vectorized patch to the model dimension D. The resulting outputs are referred to as patch embedding. The ViT model incorporates three main components: the embedding layer, the encoder layer and the classifier layer. These components will be discussed in detail as follows:

Embedding Layer
Transform models treat patches individually as tokens and map them to higher dimensions through learnable linear projections. These embedded projections are then combined with a learnable class token U Class that plays a crucial role in the classification process. To preserve the positional information and to retain the spatial positioning of the patches, positional embedding E Position is employed. Each patch in the image can be located precisely based on these positional embedding. The patch concatenated with the token Y 0 is represented by the following Equation (1): This equation captures the fusion of the class token U Class with the encoded patches to form the final input representation for further processing in the model.

Encoding Layer
In this particular step, the transformer encoder plays a crucial role in processing a sequence of embedded patches, denoted as Y 0 . The ViT utilizes a set of L encoder blocks, which are further subdivided into two distinct sub-components: Multi-Head Self-Attention (MHSA) and the Multi-Layer Perceptron (MLP). The MHSA block serves as a pivotal component within the encoder block, incorporating self-attention and concatenation layers. Specifically, given an input x = x1, x2, ..., xn, an attention operation is performed with the transformer on a set of queries Q using all available keys K and values V. This process is represented in Equation (2).
In Equation (2), the weight matrices W Q , W K and W V are trainable parameters that determine the importance or weight assigned to the value, query and key, respectively. The process involves calculating the dot product of the queries Q across all keys K, scaling it by the square root of D and applying a SoftMax classifier for classification. The transformer executes multiple parallel iterations of scaled dot product attention using different weights, known as attention heads. The outputs of these attention heads are then merged together to calculate the end result, as listed in the Equation (3). MHSA(Q, K, V) = Concatenate Attention 1 , Attention 2 , ..., Attention n W 0 In Equation (3), W Q i , W K i , W V i and W 0 refer to the trainable parameter matrices. The final output at the Ith layer of the MHSA block is formulated in Equation (4).
The MLP comprises two fully connected layers, which are connected sequentially. Following the fully connected layers, the ReLU activation function is applied. The output of the MLP is provided in Equation (5).

Classification Layer
In the given sequence, the very initial entity Z 0 l is extracted and passed on to an external head classifier responsible for predicting the last layer of the encoder. The head classifier performs classification by assigning the input to one of two corresponding class labels: "Healthy" or "Infected". The formulation for this classification process is provided below in Equation (6).
Alexey Dosovitskiy et al. [26] proposed three fundamental versions of the ViT, namely ViT-Base, ViT-Large and ViT-Huge. In each version, the number of encoders, hidden dimensions, attention heads and classifiers differ. The ViT-Base variant is trained using a patch size of 16 × 16, employing 12 layers in the encoder, a hidden size of 768 and 12 attention heads. On the other hand, the ViT-Large and ViT-Huge versions are computationally more demanding. For a detailed overview of the specifications for each version, please refer to Table 2. During the experiments, the ViT-Base model is fine-tuned with specific configurations. The projection dimensions, number of heads, transformer layers and MLP head units are set to 64, 4, 8 and 1024, respectively. Following the MLP heads, a SoftMax classifier is employed for classification, distinguishing between two classes: "Healthy" and "Infected". The tweaking process of the ViT-Base method successfully reduces the total number of learning parameters without compromising the overall performance.

Experimental Results
This section delves into the evaluation and assessment metrics, and graphical outcomes. We begin by describing the experimental setup and performance metrics. Then, we discuss the evaluated results. All models, including our proposed GreenViT, underwent training for a total of 10 epochs, employing a low learning rate to ensure the retention of previously acquired knowledge. The pre-trained model continually updated its learning parameters to optimize performance on the designated dataset. After obtaining the results, each model underwent retraining using its default input size of 224 × 224 while the proposed GreenViT utilized 72 × 72, also employing a batch size of 32. The Adam optimizer was utilized with a learning rate of 1 × 10 −4 and momentum of 0.9. The experiments were conducted on an NVIDIA GeForce RTX 3090 Graphical Processing Unit (GPU) that has 24 GB on-chip memory, equipped with 64 GB of onboard memory (Nvidia Corporation, Santa Clara, CA, USA). The single-precision floating-point computing capability of the GPU can achieve a peak performance of 36 TFLOPS. For implementation, we utilized the Keras DL framework with TensorFlow 2.9.1 serving as the backend.

Evaluation Metrics
The proposed GreenViT model was assessed based on various evaluation metrics, such as precision, recall, F1-score and accuracy, where TP represents True Positive, TN represents True Negative, FP represents False Positive and FN depicts False Negative.

Quantitative Results
This study conducted a comparison between the proposed GreenViT and various pre-trained CNN-based architectures for plant disease detection. The evaluation focused on the parameters, precision, recall, F1-score and accuracy. Among the models examined, such as VGG19, VGG16, EfficientNetB0, MobileNetV1 and MobileNetV3Small, most of them demonstrated similar performance. However, the base ViT performed the worst compared with the other models; on the other hand, the proposed GreenViT model achieved superior accuracies of 100%, 98% and 99% on all three datasets, while also exhibiting the lowest False Alarm Rate (FAR) compared with the other SOTA models. Notably, when comparing the proposed GreenViT with MobileNetV1, both models demonstrated computational efficiency, but the proposed GreenViT showcased low FAR and still outperformed all the included datasets. A detailed performance comparison of the employed models is listed in Table 3. It is evident that the pre-trained models achieve high performance with a low FAR. Nevertheless, the FAR remains elevated and necessitates improvement. Consequently, this research explores the refinement and pre-training of a CNN architecture, specifically GreenViT, with a focus on accuracy and reducing incorrect predictions. Following fine-tuning, GreenViT demonstrates the best performance among the other models, exhibiting fewer false predictions. Furthermore, the proposed GreenViT performance was evaluated employing 5-fold and 10-fold cross-validation on all the included datasets. The cross-validation accuracies show that our GreenViT maintains a competitive performance across all folds, even though there is a slight decrease in average test accuracy when the training samples in each fold are smaller compared with the whole dataset. This consistent performance demonstrates the robustness and reliability of GreenViT. Tables 4 and 5 list a comprehensive overview of the 5-fold and 10-fold cross-validation accuracies for each dataset, including the average test accuracy across the 5 and 10 folds. These results reaffirm the effectiveness of our GreenViT in handling diverse datasets and its ability to yield consistent and promising results in real-world applications. Table 3. Quantitative evaluation of GreenViT in contrast to SOTA models using the included datasets. The proposed GreenViT model is highlighted in blue. The upward arrow (↑) depicts higher value is better.  Figure 2 illustrates the confusion matrix of the GreenViT method trained on different benchmark datasets. The dark green diagonal corresponds to TP, while the saturation indicates accurate classifications. The proposed GreenViT demonstrates superior overall classification accuracy compared with the SOTA models, although there are some misclassifications within both categories. The training accuracy and loss graphs are visualized in Figure 3. The vertical axis represents accuracy and loss, while the horizontal axis represents the total number of epochs. It is evident from Figure 3 that GreenViT effectively detects plant diseases. As the number of training and validation iterations increases, the line graphs of training and validation accuracy change, as depicted in Figure 3a. The proposed GreenViT converges at seven epochs, achieving training and validation accuracies of 100%, 98% and 99% on the PV, DRLI and PC datasets, respectively. Similarly, the training and validation loss values change and decrease to 0.0 and 0.09, respectively, as depicted in Figure 3b. In addition, the suggested GreenViT is compared with the other pre-trained models in Table 3. The results indicate that the proposed GreenViT outperforms the other pre-trained models listed in Table 3.

Qualitative Results
We performed a visual analysis to determine the qualitative results of the proposed GreenViT model in distinguishing images with infection from those that are healthy plants based on class activation. The results, as shown in Figure 4, demonstrate the robustness of GreenViT in detecting diseased regions within a given input image. Figure 4 showcases the visual outcomes of the proposed GreenViT model for the samples obtained from the all three included datasets. The first row represents the input images from the PV dataset. The second row depicts the images from DRLI dataset. The third row contains images from the newly created PC dataset. All the samples are quite different from each other in type, size, geometry and color schema. The fourth row represents the ground truth (GT) labels which are the actual labels for each input image, while the last row shows the predicted label by the proposed GreenViT model. The infected images are highlighted in red, while the healthy samples are denoted by blue. The analysis depicted in Figure 4   . The objective of plant disease detection is to identify the presence of infections, but certain unique leaf images present challenges to the human eye and are not easily distinguishable without assistance. In our study, we addressed this issue by utilizing GreenViT to visually compare its performance on various datasets. The included figure displays a series of sample leaf images from different datasets. The first row consists of images from the PV, the second row showcases images from the DRLI and the third row represents images from the PC dataset. The second last row contains ground truth (GT) labels, where healthy samples are highlighted in blue text, while infected samples are indicated in red text. Through this visual comparison, we evaluated the effectiveness of GreenViT in detecting plant diseases across different datasets.

Time Complexity
In order to evaluate the effectiveness, performance and suitability for deployment of a DL model, it is crucial to conduct real-time assessments on various devices, including small edge device like the RPi 4 (Model B+), which incorporates a Central Processing Unit (CPU). The RPi 4B+ features a quad-core Cortex-A72 64-bit processor with 1.5 GHz, also comes with four GB of main memory. The specifications of the CPU analyzing the Frames Per Second (FPSs) of the proposed GreenViT model can be found in Section 3 of this paper. The established criterion for evaluating the model's performance in optimal applications achieves an FPS of 30 or higher, which is considered optimal for real-world scenarios according to References [38,45]. To assess the model's performance, the authors recorded a brief video of plants using a mobile phone. The FPSs obtained for the proposed GreenViT model when utilizing the RPi 4B+ and CPU are 0.34 and 22.19, respectively. Comparing the inference speed of the ViT base variant and the proposed GreenViT, it becomes evident that the proposed model performs more favorably than the ViT Base and the modified GreenViT is a more suitable option for edge devices. This whole comparison supports the notion that the execution of the newly proposed GreenViT method is satisfactory. Therefore, the model exhibits a capability for real-time processing and operation. Table 6. An assessment of the proposed GreenViT FPS against several other DL models. This analysis provides the relative performance of each model in terms of inference speed. The proposed GreenViT model is highlighted in blue. The downward arrow (↓) illustrates that smaller value is better while upward arrow (↑) depicts higher value is better.

Model
Parameters (

Conclusions
According to the proposed study, it introduces a plant disease and infection detection method based on transformers that outperform existing SOTA studies. Additionally, to enhance the performance of the method, the proposed GreenViT was fine-tuned to bring down the number of parameters from 86 M to around 21.65 M. A total of three datasets, namely, the PV, DRLI and PC datasets, were employed to evaluate the proposed GreenViT. The study also showcases a comprehensive quantitative and qualitative analysis to prove the model generalization ability in real-world scenarios. In order to validate the efficacy and efficiency of the proposed approach, future experiments will utilize edge devices or drones that utilize a variety of leaf diseases. In the context of intelligent edge devices, the application of an attention-based model shows promise as a viable avenue of exploration that could be explored effectively.